19 research outputs found

    Morrigan: A Composite Instruction TLB Prefetcher

    Get PDF
    The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses. This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%.This work is partially supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the NSF grant CCF-1912617, the Semiconductor Research Corporation grant 2936.001, and generous gifts from Intel Labs. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been supported by the Spanish Ministry of Economy, Industry and Competitiveness under the Ramon y Cajal fellowship No. RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Morrigan: A composite instruction TLB prefetcher

    Get PDF
    The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses. This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%.This work is partially supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project, the Generalitat de Catalunya (contract 2017-SGR-1414), the NSF grant CCF-1912617, the Semiconductor Research Corporation grant 2936.001, and generous gifts from Intel Labs. Georgios Vavouliotis has been supported by the Spanish Ministry of Economy, Industry and Competitiveness and the European Social Fund under the FPI fellowship No. PRE2018-087046. Marc Casas has been supported by the Spanish Ministry of Economy, Industry and Competitiveness under the Ramon y Cajal fellowship No. RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Peachy Parallel Assignments (EduHPC 2018)

    Get PDF
    Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These are high-quality assignments, previously tested in class, that are readily adoptable. This collection of assignments includes implementing a subset of OpenMP using pthreads, creating an animated fractal, image processing using histogram equalization, simulating a storm of high-energy particles, and solving the wave equation in a variety of settings. All of these come with sample assignment sheets and the necessary starter code.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Facilitar la inclusión de ejercicios prácticos de programación paralela en cursos de Computación Paralela o de alto rendimiento (HPC)Comunicación en congreso: Descripción de ejercicios prácticos con acceso a material ya desarrollado y probado

    Adaptive power shifting for power-constrained heterogeneous systems

    Get PDF
    The number and heterogeneity of compute devices, even within a single compute node, has been steadily on the rise. Since all systems must operate under a power cap, the number of discrete devices that can run simultaneously at their highest frequency is limited by the globally-imposed power cap. Current systems incorporate a centralized power management unit that statically controls the distribution of power among the devices within the node. However, such static distribution policies are unaware of the dynamic utilization profile across the devices, which leads to unfair power allocations that end up degrading system throughput performance. The problem is particularly acute in the presence of heterogeneity since type-specific performance-boost capabilities cannot be leveraged via utilization-agnostic static power allocations. This paper proposes Adaptive Power Shifting for multi-accelerator heterogeneous systems (APS), a technique that leverages system utilization information to dynamically allocate and re-distribute power budgets across multiple discrete devices. Democratizing the power allocation based on dynamic needs results in dramatic speedup over a need-agnostic static allocation. We use APS in a real OpenPOWER compute node with 2 CPUs and 4 GPUs to demonstrate the value of on-demand, equitable power allocations. Overall, the proposed solution increases performance with respect to two state-of-the-art techniques by up to 14.9% and 13.8%.This work has been partially supported by the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877), by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB-C22), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center initiative. Ll. Alvarez has been supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under the Juan de la Cierva Formacion fellowship No. FJCI-2016- 30984. M. Moreto has been supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    OpenCL-based FPGA accelerator for semi-global approximate string matching using diagonal bit-vectors

    Get PDF
    An FPGA accelerator for the computation of the semi-global Levenshtein distance between a pattern and a reference text is presented. The accelerator provides an important benefit to reduce the execution time of read-mappers used in short-read genomic sequencing. Previous attempts to solve the same problem in FPGA use the Myers algorithm following a column approach to compute the dynamic programming table. We use an approach based on diagonals that allows for some resource savings while maintaining a very high throughput of 1 alignment per clock cycle. The design is implemented in OpenCL and tested on two FPGA accelerators. The maximum performance obtained is 91.5 MPairs/s for 100 × 120 sequences and 47 MPairs/s for 300 × 360 sequences, the highest ever reported for this problem.This research was supported by the EU Regional Development Fund under the DRAC project [001-P-001723], by the MINECO-Spain (contract TIN2017-84553-C2-1-R), by the MICIU-Spain (contract RTI2018-095209-B-C22) and by the Catalan government (contracts 2017-SGR-1624, 2017-SGR313, 2017-SGR-1328). M.M. was partially supported by the MINECO under RYC-2016-21104. We thank Intel for granting us access to the DevCloud system and let us join the HARP research program. The presented HARP-2 results were obtained on resources hosted at the Paderborn Center for Parallel Computing (PC2) in the Intel Hardware Accelerator Research Program (HARP2).Peer ReviewedPostprint (author's final draft

    Observation of second sound in a rapidly varying temperature field in Ge

    Get PDF
    Second sound is known as the thermal transport regime where heat is carried by temperature waves. Its experimental observation was previously restricted to a small number of materials, usually in rather narrow temperature windows. We show that it is possible to overcome these limitations by driving the system with a rapidly varying temperature field. This effect is demonstrated in bulk Ge between 7 kelvin and room temperature, studying the phase lag of the thermal response under a harmonic high frequency external thermal excitation, addressing the relaxation time and the propagation velocity of the heat waves. These results provide a new route to investigate the potential of wave-like heat transport in almost any material, opening opportunities to control heat through its oscillatory nature.Comment: After careful revision we have ruled out the presence of coherent noise and from any other noise source within the reported data. We have updated the manuscript providing a detailed analysis of the photoreflectance signal, demonstrating with experiments its thermal origi

    The DeepHealth Toolkit: A Unified Framework to Boost Biomedical Applications

    Get PDF
    Given the overwhelming impact of machine learning on the last decade, several libraries and frameworks have been developed in recent years to simplify the design and training of neural networks, providing array-based programming, automatic differentiation and user-friendly access to hardware accelerators. None of those tools, however, was designed with native and transparent support for Cloud Computing or heterogeneous High-Performance Computing (HPC). The DeepHealth Toolkit is an open source Deep Learning toolkit aimed at boosting productivity of data scientists operating in the medical field by providing a unified framework for the distributed training of neural networks, which is able to leverage hybrid HPC and cloud environments in a transparent way for the user. The toolkit is composed of a Computer Vision library, a Deep Learning library, and a front-end for non-expert users; all of the components are focused on the medical domain, but they are general purpose and can be applied to any other field. In this paper, the principles driving the design of the DeepHealth libraries are described, along with details about the implementation and the interaction between the different elements composing the toolkit. Finally, experiments on common benchmarks prove the efficiency of each separate component and of the DeepHealth Toolkit overall

    Famílies botàniques de plantes medicinals

    Get PDF
    Facultat de Farmàcia, Universitat de Barcelona. Ensenyament: Grau de Farmàcia, Assignatura: Botànica Farmacèutica, Curs: 2013-2014, Coordinadors: Joan Simon, Cèsar Blanché i Maria Bosch.Els materials que aquí es presenten són els recull de 175 treballs d’una família botànica d’interès medicinal realitzats de manera individual. Els treballs han estat realitzat per la totalitat dels estudiants dels grups M-2 i M-3 de l’assignatura Botànica Farmacèutica durant els mesos d’abril i maig del curs 2013-14. Tots els treballs s’han dut a terme a través de la plataforma de GoogleDocs i han estat tutoritzats pel professor de l’assignatura i revisats i finalment co-avaluats entre els propis estudiants. L’objectiu principal de l’activitat ha estat fomentar l’aprenentatge autònom i col·laboratiu en Botànica farmacèutica

    Characterizing the impact of last-level cache replacement policies on big-data workloads

    Full text link
    In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph traversal methods exhibiting irregular memory access patterns. Recent work demonstrates that, due to the highly irregular memory access patterns of data-dependent graph traversals, state-of-the-art graph-processing workloads spend up to 80 % of the total execution time waiting for memory accesses to be served by the DRAM. The vast disparity between the Last Level Cache (LLC) and main memory latencies is a problem that has been addressed for years in computer architecture. One of the prevailing approaches when it comes to mitigating this performance gap between modern CPUs and DRAM is cache replacement policies. In this work, we characterize the challenges drawn by graph-processing workloads and evaluate the most relevant cache replacement policies.Comment: Extended abstract submitted to the 10th BSC Doctoral Symposiu
    corecore